pkgs_needed = c(
"tidyverse",
"dbscan", "GGally", "pheatmap",
"flowCore","flowViz","flowPeaks", "ggcyto"
)
letsinstall = setdiff(pkgs_needed, installed.packages())
if (length(letsinstall) > 0) {
for (pkg in letsinstall)
BiocManager::install(pkg)
}Clustering
Goal
In this lab we will learn the basics of clustering. The methods we will cover include hierarchical clustering, K means and density clustering.
Packages
Install and load packages.
Hierarchical clustering
The Morder data are gene expression measurements for 156 genes on T cells of 3 types (naïve, effector, memory) from 10 patients (Holmes et al. 2005).
Here we load the Morder data.frame from the online directory.
load(url("http://web.stanford.edu/class/bios221/data/Morder.RData"))
# if download fails with timeout error, try increasing it
# options(timeout = 1000)If you downloaded the file before, you can load it from the local directory.
load("Morder.RData")Inspect the data. How many samples of each T cell type (naïve, effector, memory) are there in the data?
In base R the function to perform hierarchical clustering is hclust. To cluster the genes with hierarchical clustering you first need to compute a distance matrix storing all pairwise (gene-to-gene) dissimilarities. This is how to do it:
# distance calculation
D = dist(t(Morder))
# clustering
gene_clust = hclust(d = D)
# plot dendrogram
plot(gene_clust)Why in the provided code the input to dist function is t(Morder)?
In hierarchical clustering one needs to choose the method for agglomerating the clusters. By default hclust uses a “complete” linkage method (see ?hclust for info).
Redo hierarchical clustering with the ward.D2 method and plot the dendrogram.
Notice that the values on the y axis of hclust dendrogram changed. What do they correspond to?
Note that in the hclust there are the ward.D and ward.D2 methods available. Please call ?hclust to read about the difference between the two methods.
Next, instead of clustering genes, we will apply hierarchical clustering for samples (observations)
Use dist and hclust (with default linkage method) to cluster samples.
How many clusters of samples are there at the dendrogram height of 12? Hint: the abline function might be helpful.
Now that you know how to perform hierarchical clustering, use pheatmap to generate a heatmap with clustered rows and columns. Below, we do some extra work compared to the default parameters to make sure that 0 is in the center of the color scale, and use beautiful colors.
library("pheatmap")
mycolors = colorRampPalette(c("#FFD800", "#0056B9"))(100)
pheatmap(
mat = Morder,
fontsize_col = 5, fontsize_row = 10,
color = mycolors,
breaks = max(abs(Morder)) * seq(-1, +1, length.out = length(mycolors) + 1)
) What type of distance and which clustering method does pheatmap use by default for clustering rows and columns?
Note that these are default values for dist and hclust, too. Look at how clustering heatmap changes if you use different distance and clustering methods (e.g. ‘ward.D2’).
Mass cytometry (CyTOF)
Cytometry is a biophysical technology that allows you to measure physical and chemical characteristics of cells. Modern flow and mass cytometry allows for simultaneous multiparametric analysis of thousands of particles per second.
Flow cytometry enables the simultaneous measurement of 15, whereas mass cytometry (CyTOF) of as many as 40 proteins per single cell.
We will be working with the CyTOF dataset from a single-cell mass cytometry study by Bendall et al. on differential immune drug responses across human hematopoietic cells over time.
First we load the necessary packages.
library("tidyverse")
library("flowCore")
library("flowViz")
library("flowPeaks")
library("ggcyto")Here we download the data.
# download file from web (14.3 Mb)
download.file(
url = "http://web.stanford.edu/class/bios221/data/Bendall_2011.fcs",
destfile = "Bendall_2011.fcs",
mode = "wb"
)
# If download fails with timeout error, try increasing it
# options(timeout = 1000)
# On Linux and MacOS, you can also try to use the 'wget' command in the Unix terminal insteadLoad this data in R using flowCore package.
# load file in R - here we use a function from flowCore package
fcsB = flowCore::read.FCS("Bendall_2011.fcs")Look at the structure of the fcsB object. How many variables and how many cells were measured?
Inspect slots of the flowFrame object:
slotNames(fcsB)[1] "exprs" "parameters" "description"
Different slots of the flowFrame objects can be accessed with exprs(fcsB), parameters(fcsB) and description(fcsB) functions.
Data preprocessing
First we get the data table that reports the mapping between isotopes and markers (antibodies).
Here we download the table.
# download data from the web
markersB = read_csv(url("http://web.stanford.edu/class/bios221/data/Bendall_2011_markers.csv"))If you already downloaded the data you can load it locally.
# load data
markersB = read_csv("Bendall_2011_markers.csv")Now we replace the isotope names in the column names of fcsB with the marker names. This is simply to make the subsequent analysis and plotting code more readable.
# match isotope names to marker names
mt = match(markersB$isotope, colnames(fcsB))
# check that all isotope names were matched
stopifnot(!any(is.na(mt)))
# replace isotope names with marker names
colnames(fcsB)[mt] = markersB$markerBelow, we show how to plot the joint distribution of the cell lengths and the DNA191 (this variable indicates the activity of the cell, i.e. whether the cell is dead or alive):
flowPlot(fcsB, plotParameters = c("Cell_length", "DNA191"), logy = TRUE)It is standard to transform both flow and mass cytometry data using one of several special functions, we take the example of the inverse hyperbolic sine (arcsinh), which serves as a variance stabilizing transformation.
First, we show the distribution on untransformed raw data:
densityplot(~`CD3all`, fcsB)To apply the transformation and to plot the data you can use functions from the flowCore package. After the transformation the cells seem to form two clusters. Based solely on one dimension (CD3all) we see two cell subsets (the two modes).
asinhT = arcsinhTransform(a = 0.1, b = 1)
cols_to_transform = setdiff(colnames(fcsB), c("Time", "Cell_length", "absoluteEventNumber"))
trans1 = transformList(cols_to_transform, asinhT)
fcsBT = transform(fcsB, trans1)
densityplot(~`CD3all`, fcsBT)\(k\)-means
Let’s cluster cells into two groups using one-dimensional \(k\)-means filter. To learn more about the arguments of the functions type ?kmeansFilter and ?flowCore::filter
kf = kmeansFilter("CD3all"=c("Pop1","Pop2"), filterId="myKmFilter")
fres = flowCore::filter(fcsBT, kf)
summary(fres)Pop1: 33429 of 91392 events (36.58%)
Pop2: 57963 of 91392 events (63.42%)
Split two data clusters into two separate flowFrame objects:
fcsBT1 = split(fcsBT, fres, population="Pop1")
fcsBT2 = split(fcsBT, fres, population="Pop2")We can also cluster cells using k-means with the flowPeaks function from flowPeaks package. The algorithm for this clustering algorithm is specified in detail in a paper by Ge Y. et al (2012).
dat = data.frame(exprs(fcsBT)[ , c("CD56", "CD3all")])
fp = flowPeaks(dat)
Starting the flow Peaks analysis...
Task A: compute kmeans...
step 0, set the intial seeds, tot.wss=17597.8
step 1, do the rough EM, tot.wss=11900.7 at 0.096195 sec
step 2, do the fine transfer of Hartigan-Wong Algorithm
tot.wss=11846.3 at 0.221722 sec
...finished summarization at 0.328 sec
Task B: find peaks...
finished at 0.35 sec
plot(fp)How many dimensions (markers) does the above code use to split the data into 4 cell subsets using kmeans?
Note that here fcsB or fcsBT are not ‘data.frames’ but objects of class ‘flowFrame’. This meansthat you cannot use fcsB and fcsBT (without conversion to data.frame) as inputs to ggplot. However, ‘flowFrame’ objects hold marker expression data together with sample information data, so you could access any variables you need for plotting directly from these objects (and this would be lost by converting them to data.frame). This is why for plottimg now we use a Bioconductor package ggcyto, built on top of ggplot2. This package includes functions for generating visualizations specifically for cytometry data, starting from ‘flowFrame’ objects.
We can look at distributions of measured markers, e.g CD4, by plotting the histogram:
library("ggcyto")
# Untransformed data
ggcyto(fcsB,aes(x = CD4)) + geom_histogram(bins = 90) # Transformed data
ggcyto(fcsBT, aes(x = CD4)) + geom_histogram(bins = 90) autoplot function plots the density of selected markers.
# ggcyto automatic plotting
autoplot(fcsBT, "CD4")We can compare the distributions of two markers, e.g. CD4 and CD8.
# 2d density plot
ggcyto(fcsBT, aes(x = CD4, y = CD8)) + geom_density2d(colour="black") # hexbin plot
ggcyto(fcsBT, aes(x = CD4, y = CD8)) + geom_hex(bins = 50) The same can be achieved with autoplot for two variables.
# ggcyto automatic plotting
autoplot(fcsBT, "CD4", "CD8", bins = 50)For more details on capabilities of ggcyto refer to the following link
Density based clustering
Data sets such as flow cytometry containing only a few variables (markers) and a large number of observations (cells) are amenable to density clustering. DBSCAN algorithm looks for regions of high density separated by sparse emptier regions. This method has the advantage of being able to cope with clusters that are not necessarily convex (i.e. not blob-shaped). One implementation of such a method is called dbscan, let us look at an example by running the code below.
library("dbscan")
# Select a small subset of 5 protein markers
mc5 = exprs(fcsBT)[, c("CD4", "CD8", "CD20", "CD3all", "CD56")]
res5 = dbscan::dbscan(mc5, eps = .65, minPts = 30)
mc5df = data.frame(mc5)
mc5df$cluster = as.factor(res5$cluster)How many clusters did dbscan find?
Note that we do not count the 0 as a cluster, since it contains all cells that were not assigned to any cluster. Looking up the documentation of dbscan (?dbscan) we find that “Points which cannot be assigned to a cluster will be reported as members of the noise cluster 0.”
How many cells were clustered into cluster 3 by dbscan?
We can now generate a CD8-vs-CD4 2d-density plot for the cells colored by their assigned cluster labels, computed by dbscan:
ggplot(mc5df, aes(x = CD4, y = CD8, colour = cluster)) + geom_density2d()And do the same for CD3all and CD20 markers:
ggplot(mc5df,aes(x = CD3all, y = CD20, colour = cluster))+ geom_density2d()Observe that the nature nature of the clustering is multidimensional, as the projections into two dimensions show overlapping clusters.
Validating and choosing the number of clusters
The clustering methods we have described are tailored to deliver the best grouping of the data under various constrains, however they will always deliver groups, even if there are none. This is important, e.g. when performing kmeans clustering, as we have to set the ‘k’ parameter (for the number of clusters to group observations into) ahead of time. What choice of ‘k’ is valid though?
Here we want to illustate the use of the “wss” (within sum of squares) statistic to evaluate the quality of a clustering. Note that as \(k\) (number of cluster for k-means algorithm) increases, wss will also decrease. We simulate data coming from 4 groups. In particular, we generate 2-dimensional observations (as if there were only 2 proteins measured for each cell). The four groups are generated from 2-d multivariate normals with centers at \(\mu_1 = (0, 0)\), \(\mu_2 = (0, 8)\), \(\mu_3 = (8, 0)\), \(\mu_4 = (8, 8)\). In this simulation, we know the ground truth (4 groups), but we will try to cluster the data using the kmeans argorithm with different choices for the k parameter. We will see how the wss statistic varies as we vary k.
We have used the |> operator (if you do not understand the code, try to see what simul4 contains and repeat the same using code that does not use the |> operator).
simul4 = lapply(c(0,8), \(mx) {
lapply(c(0, 8), \(my) {
tibble(x = rnorm(100, mx, 2),
y = rnorm(100, my, 2),
class = paste0(mx, my)
)
}) |> bind_rows()
}) |> bind_rows()ggplot(simul4, aes(x = x, y = y)) +
geom_point(aes(color = class), size = 2)# Compute the kmeans within group wss for k=1 to 12
wss = rep(0,8)
# for a single cluster the WSS statistic is just sum of squares of centered data
wss[1] = sum(apply(scale(simul4[,1:2], scale = F), 2, function(x){ x^2 }))
# for k = 2, 3, ... we perform kmeans clustering and compute the associated WSS statistic
for (k in 2:8) {
km4 = kmeans(simul4[,1:2],k)
wss[k] = sum(km4$withinss)
}
# Now, we are ready to plot the computed statistic:
ggplot(data.frame(k = 1:length(wss), wss = wss)) +
geom_point(aes(x = k, y = wss), color = "blue", size= 3) +
xlab('k') + ylab('WSS(k)')Within sum of squares (wss) statistic, we see that the last substantial decrease of the statistic occurres before \(k=4\), and for values \(k = 5, 6, \dots\) the quantity ‘levels-off’. In practice, we would choose \(k=4\), a value happening at the ‘elbow’ of the plot (elbow-rul). Of course this choice is still somewhat subjective. The book chapter describes additional ways of choosing k (e.g. the gap statistic).
Session info
sessionInfo()R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Uzhgorod
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dbscan_1.2-0 ggcyto_1.32.0 flowWorkspace_4.16.0
[4] ncdfFlow_2.50.0 BH_1.84.0-0 flowPeaks_1.50.0
[7] flowViz_1.68.0 lattice_0.22-6 flowCore_2.16.0
[10] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[13] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[16] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[19] tidyverse_2.0.0 pheatmap_1.0.12
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 farver_2.1.2 fastmap_1.2.0
[4] XML_3.99-0.17 digest_0.6.36 timechange_0.3.0
[7] lifecycle_1.0.4 magrittr_2.0.3 compiler_4.4.1
[10] rlang_1.1.4 tools_4.4.1 utf8_1.2.4
[13] yaml_2.3.9 data.table_1.15.4 knitr_1.48
[16] labeling_0.4.3 htmlwidgets_1.6.4 interp_1.1-6
[19] bit_4.0.5 plyr_1.8.9 RColorBrewer_1.1-3
[22] KernSmooth_2.23-24 withr_3.0.0 RProtoBufLib_2.16.0
[25] BiocGenerics_0.50.0 grid_4.4.1 stats4_4.4.1
[28] fansi_1.0.6 latticeExtra_0.6-30 colorspace_2.1-0
[31] scales_1.3.0 MASS_7.3-61 isoband_0.2.7
[34] cli_3.6.3 rmarkdown_2.27 crayon_1.5.3
[37] generics_0.1.3 tzdb_0.4.0 IDPmisc_1.1.21
[40] zlibbioc_1.50.0 parallel_4.4.1 matrixStats_1.3.0
[43] vctrs_0.6.5 jsonlite_1.8.8 cytolib_2.16.0
[46] hms_1.1.3 S4Vectors_0.42.1 bit64_4.0.5
[49] Rgraphviz_2.48.0 jpeg_0.1-10 hexbin_1.28.3
[52] glue_1.7.0 stringi_1.8.4 gtable_0.3.5
[55] deldir_2.0-4 munsell_0.5.1 pillar_1.9.0
[58] htmltools_0.5.8.1 graph_1.82.0 R6_2.5.1
[61] vroom_1.6.5 evaluate_0.24.0 Biobase_2.64.0
[64] png_0.1-8 Rcpp_1.0.12 gridExtra_2.3
[67] xfun_0.45 pkgconfig_2.0.3